---
title: "Achieving Statistical Significance with Control Variables and without Transparency"
author:
- Gabriel S. Lenz^[glenz@berkeley.edu. Travers Department of Political Science, University
  of California, Berkeley. 210 Barrows Hall, Berkeley, CA 94720-1950]
- Alexander Sahn^[asahn@berkeley.edu. Travers Department of Political Science, University
  of California, Berkeley. 210 Barrows Hall, Berkeley, CA 94720-1950]
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  pdf_document:
    citation_package: natbib
    fig_caption: yes
    keep_tex: yes
    number_sections: yes
  html_document:
    df_print: paged
  word_document: default
biblio-style: apsr
fontsize: 12pt
geometry: margin=1in
header-includes:
- \renewcommand{\familydefault}{\sfdefault}
- \usepackage{hyperref}
bibliography: covariates.bib
abstract: "How often do articles depend on suppression effects for their findings? How often do they disclose this fact? By suppression effects, we mean control-variable-induced increases in estimated effect sizes. Researchers generally scrutinize suppression effects as they want reassurance that researchers have a strong explanation for them, especially when the statistical significance of the key finding depends on them. In a re-analysis of observational studies from a leading journal, we find that over 30% of articles depend on suppression effects for statistical significance. Although increases in key effect estimates from including control variables are of course potentially justifiable, none of the articles justify or disclose them. These findings may point to a hole in the review process: journals are accepting articles that depend on suppression effects without readers, reviewers, or editors being made aware."
---


\newpage

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r load, echo=FALSE, message=FALSE, warning=FALSE,include = FALSE}
#Load Libraries
library(tidyverse)
library(knitr)
library(ggridges)
library(here)
library(readxl)
library(gridExtra)

wd <- here()
setwd(wd)

load("reanalysis.Rdata")

STATA <- t(read.csv("replication_commands.csv"))
colnames(STATA) = STATA[1, ] # the first row will be the header
STATA = as.data.frame(STATA[-1, ] )         # removing the first row.
STATA <- subset(STATA, select=-c(V18:V23)) 
STATA <- subset(STATA, author %in% bv_mv$author)

avg_num_covar <- mean(str_count(STATA$covariates, "\\S+"))
avg_num_fe <- mean(str_count(STATA$fixed_effects, "\\S+"))
	
#Add log+1 percent change Calculations
bv_mv <- bv_mv %>% 
  mutate(bv_bp = case_when(sign(bv_b)==sign(mv_b) ~ abs(bv_b),
                           sign(bv_b)!=sign(mv_b) ~ 0),
         mv_bp = case_when(sign(bv_b)==sign(mv_b) ~ abs(mv_b),
                           sign(bv_b)!=sign(mv_b) ~ abs(mv_b - bv_b)),
         #mv_bp = ifelse(mv_b<0, mv_b*-1,mv_b), #recode so that multivariate is always positive and 
         #bv_bp = ifelse(mv_b<0, bv_b*-1,bv_b), # if Multivariate is negative, Flip sign Of bivariate
         #mv_bp = ifelse(bv_bp<0, mv_bp+bv_bp,mv_bp), #If bivariate and multivariate change signs, which only happens if Bivariate is less than zero
         #bv_bp = ifelse(bv_bp<0, bv_bp+bv_bp,bv_bp),
         inflate2 = (log(mv_bp+2) - log(bv_bp+2))*100,
         inflate = (log(mv_bp + 1) - log(bv_bp + 1))*100,
         #inflate2 = (log(mv_bp+2) - log(bv_bp+2))*100,
         #inflate5 = (log(mv_bp+5) - log(bv_bp+5))*100,
         #inflate10 = (log(mv_bp+10) - log(bv_bp+10))*100,
         linflate = ifelse(inflate>0,log(inflate),log(inflate*-1)*-1),  
         linflate = ifelse(inflate==0,0,linflate), 
         inflate_p =  ifelse(sign(bv_b)!=sign(mv_b), (mv_p-1) - (1-bv_p), mv_p- bv_p ),
         inflate_lp = (log(mv_p+1)-log(bv_p+1))*100,
         inflate_se = (log(mv_se+1) - log(bv_se+1)) * 100,
         n_mv = mv_n/bv_n * 100,
         pinflate = abs((mv_b - bv_b) / bv_b) * ifelse(abs(mv_b)>abs(bv_b), 1, -1) * 100,
         pinflate_se = (mv_se - bv_se) / bv_se * 100)

#Merge in replication info spreadsheet
info <- read_excel("replication_info.xlsx", sheet = "Full")
merged <- inner_join(bv_mv, 
                     subset(dplyr::rename(info, author = first_author), status!="Excluded"), by="author")

# inflation summary stats
inflation.stats <- function(name, data){
  n <- nrow(data)
  mean <- mean(data$inflate)
  mean_r2 <- sum(data$inflate * ifelse(is.na(data$mv_r2), data$mv_r2p, data$mv_r2), na.rm=T) / sum(!is.na(ifelse(is.na(data$mv_r2), data$mv_r2p, data$mv_r2)))
  med <- median(data$inflate)
  g0 <- mean(data$inflate>0)
  g10 <- mean(data$inflate>10)
  mean_se <- mean(data$inflate_se)
  med_se <- median(data$inflate_se)
  g0_se <- mean(data$inflate_se>0)
  out <- c(name, n, mean, mean_r2, med, g0, g10, mean_se, med_se, g0_se)
  names(out) <- c('name', 'n', 'mean', 'r2 weighted mean', 'median', 'g0', 'g10', 'se_mean', 'se_median', 'se_g0')
  return(out)
}

inf <- rbind(inflation.stats('all', bv_mv), inflation.stats('experiments', subset(merged, type %in% 1:3)), 
                         inflation.stats('observational', subset(merged, type %in% c(0,4))),
                         inflation.stats('bivariate specification presented', subset(merged, no_covar_presented>0)),
                         inflation.stats('bivariate specification presented and ns', subset(merged, no_covar_presented>0 & bv_p>0.05)),
                         inflation.stats('bivariate p > 0.05', subset(bv_mv, bv_p>=.05 | sign(bv_b)!=sign(mv_b) )),
                         inflation.stats('bivariate p > 0.05 - observational', subset(merged, (type==0|type==4) & bv_p>=.05)),
                         inflation.stats('full specification .045 < p < .055', subset(bv_mv, mv_p>=.045 & mv_p<=.055)),
                         inflation.stats('different n in full specification', subset(bv_mv, bv_n!=mv_n)),
                         inflation.stats('experiments with no bivariate spec', subset(merged, type %in% 1:3 & no_covar_presented==0)),
                         inflation.stats('experiments with bivariate spec', subset(merged, type %in% 1:3 & no_covar_presented>0)),
                         inflation.stats('observational with no bivariate spec', subset(merged, (type==0|type==4) & no_covar_presented==0)),
                         inflation.stats('observational with bivariate spec', subset(merged, (type==0|type==4) & no_covar_presented>0)),
                         inflation.stats('observational with no bivariate spec and p > 0.05', subset(merged,  sign(bv_b)!=sign(mv_b) | bv_p>0.05 & (type==0|type==4) & no_covar_presented==0)),
                         inflation.stats('observational with no bivariate spec and p > 0.05 + wrong sign', subset(merged, (bv_p>0.05 | sign(bv_b)!=sign(mv_b)) & (type==0|type==4) & no_covar_presented==0)))



##Role of fixed effects
# merged_STATA <- inner_join(merged, STATA, by="author")
# merged_STATA$fixed_effects.x[is.na(merged_STATA$fixed_effects.x)] <- 0
# #density plot of p-value change by 's effects are no fixed effects
# ggplot()+
#   geom_density(data=subset(merged_STATA, (type==0|type==4) & no_covar_presented==0 & fixed_effects.x==1 ), aes(inflate_p)) +
#   geom_density(data=subset(merged_STATA, (type==0|type==4) & no_covar_presented==0 & fixed_effects.x!=1), aes(inflate_p), linetype="dashed") 
# #dot plots
# ggplot()+geom_dotplot(data=subset(merged_STATA, (type==0|type==4) & no_covar_presented==0 & fixed_effects.x!=1), aes(inflate_p), binwidth = 0.05)+coord_cartesian(xlim = c(-1, 0.1)) 
# ggplot()+geom_dotplot(data=subset(merged_STATA, (type==0|type==4) & no_covar_presented==0 & fixed_effects.x==1), aes(inflate_p), binwidth = 0.05)+coord_cartesian(xlim = c(-1, 0.1))
#ggplot()+ geom_histogram(data=merged, aes(mv_p),binwidth = 0.003)

#Statistics in text
#Percent not showing inflation from insignificance to significance
hdp <-round(as.numeric(inf[which(inf=='observational with no bivariate spec and p > 0.05 + wrong sign'),2])/as.numeric(inf[which(inf=='observational'),2])*100, digits = 0)

n_total<- nrow(subset(info, status=="Complete"))
n_experiment <- nrow(subset(merged, type %in% 1:3))
n_obs <- nrow(subset(merged, type %in% c(0,4)))
n_obs_show   <- nrow(subset(merged, type %in% c(0,4) & no_covar_presented>0))
n_obs_shownosig <- nrow(subset(merged,  (sign(bv_b)!=sign(mv_b) |  bv_p>0.05) & type %in% c(0,4) & no_covar_presented>0))
n_obs_noshow <- nrow(subset(merged, type %in% c(0,4) & no_covar_presented==0))
p_obs_noshow <- round(n_obs_noshow/n_obs*100, 0)
n_obs_noshownosig <- nrow(subset(merged,  (sign(bv_b)!=sign(mv_b) |  bv_p>0.05) & type %in% c(0,4) & no_covar_presented==0))
p_obs_nosigsig <- round(nrow(subset(merged, (type==0|type==4) & (bv_p>0.05 | sign(bv_b)!=sign(mv_b))&(mv_p<bv_p|sign(bv_b)!=sign(mv_b)))) /n_obs*100,0)  # Switch from nonsignificant to significant in observational studies plus one that changes sign from significant to significant
p_obs_nosigsig05 <- round(nrow(subset(merged, (type==0|type==4) & (bv_p>0.05 | sign(bv_b)!=sign(mv_b)) &mv_p <=0.05&(mv_p<bv_p| sign(bv_b)!=sign(mv_b))  )) /n_obs*100,0)  # Switch from nonsignificant to significant in observational studies plus one that changes sign from significant to significant
p_obs_nosigsignoshow <- round(nrow(subset(merged, (type==0|type==4) & (bv_p>0.05 | sign(bv_b)!=sign(mv_b))&(mv_p<bv_p| sign(bv_b)!=sign(mv_b))& no_covar_presented==0  )) /n_obs_noshow*100,0)  # Switch from nonsignificant to significant in observational studies plus one that changes sign from significant to significant
p_obs_nosigsig05noshow <- round(nrow(subset(merged, (type==0|type==4) & (bv_p>0.05 | sign(bv_b)!=sign(mv_b))&mv_p <bv_p& no_covar_presented==0 &mv_p <=0.05  )) /n_obs_noshow*100,0)  # Switch from nonsignificant to significant in observational studies plus one that changes sign from significant to significant
  p_obs_nosigsigm <- round(nrow(subset(merged, (type==0|type==4) & bv_p>0.05 & bv_p-mv_p>.1 | sign(bv_b)!=sign(mv_b)  )) /n_obs*100,0) #Same as above but shift more than 0.1
p_obs_noshow <- round(n_obs_noshow/n_obs*100,0)
#subset(merged, (type==0|type==4) & bv_p>0.05 & mv_p<0.05 | sign(bv_b)!=sign(mv_b) , select = c(author,bv_b,mv_b, bv_p,mv_p, inflate_p))

temp <- merged %>% filter(type==0|type==4) %>%  # Select observational studies (type 0) and natural experiments (type 4)
  mutate(cp = ifelse(no_covar_presented > 0, "Shown", "Bivariate Specification Not Shown"),
         bv_t = bv_b/bv_se, mv_t = mv_b/mv_se, t_mvb_bvse = mv_b/bv_se, t_bvb_mvse = bv_b/mv_se,
         p_mvb_bvse = pnorm(-abs(mv_b/bv_se))*2, 
         p_mvb_bvse2 = ifelse(sign(mv_p-bv_p)!=sign(p_mvb_bvse-bv_p), bv_p, 
                              ifelse(bv_p<mv_p, ifelse(p_mvb_bvse>mv_p, mv_p, p_mvb_bvse), ifelse(p_mvb_bvse<mv_p, mv_p, p_mvb_bvse))), 
         p_bvb_mvse = pnorm(-abs(bv_b/mv_se))*2)
temp$flip <- ifelse(sign(temp$bv_b)!=sign(temp$mv_b), 1, 0)
#temp$bv_p[sign(temp$bv_b)!=sign(temp$mv_b)]<-1
temp$inflate_p <- ifelse(temp$flip==1, 2 - temp$bv_p - temp$mv_p, temp$bv_p - temp$mv_p)
temp1 <- subset(temp, cp=="Shown")
temp1[order(temp1$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp1)
temp2 <- subset(temp, cp=="Bivariate Specification Not Shown")
temp2[order(temp2$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp2)
temp <- rbind(temp1 %>% select(author, x, bv_p, p_end=mv_p, cp, flip, inflate_p) %>% 
                mutate(type="total p-value change from bivariate to full specification"), 
              temp2 %>% select(author, x, bv_p, p_end=mv_p, cp, flip, inflate_p) %>% 
                mutate(type="total p-value change from bivariate to full specification"),
              temp1 %>% select(author, x, bv_p, p_end=p_mvb_bvse2, cp, flip, inflate_p) %>% 
                mutate(type="p-value change from coefficient change"),
              temp2 %>% select(author, x, bv_p, p_end=p_mvb_bvse2, cp, flip, inflate_p) %>% 
                mutate(type="p-value change from coefficient change"))

temp <- rbind(subset(temp, flip==0), 
              subset(temp, flip==1 & type=="total p-value change from bivariate to full specification") %>% 
                arrange(inflate_p) %>%
                mutate(x = c(31, 33, 35, 37), p_end = 1), 
              subset(temp, flip==1 & type=="total p-value change from bivariate to full specification") %>% 
                arrange(inflate_p) %>%
                mutate(x = c(31.5,33.5,35.5, 37.5), bv_p=1),
              subset(temp, flip==1 & type=="p-value change from coefficient change") %>% 
                arrange(inflate_p) %>%
                mutate(x = c(31, 33, 35, 37), p_end = 1), 
              subset(temp, flip==1 & type=="p-value change from coefficient change") %>% 
                arrange(inflate_p) %>%
                mutate(x = c(31.5,33.5,35.5, 37.5), bv_p=1))

##new calculation##
count_pval_coeffchange_notshown <- temp %>% filter(type=="p-value change from coefficient change" & bv_p >0.05 & p_end<0.05 & cp=="Bivariate Specification Not Shown") %>% nrow()
count_pval_coeffchange_shown <- temp %>% filter(type=="p-value change from coefficient change" & bv_p >0.05 & p_end<0.05 & cp=="Shown") %>% nrow()

```

Imagine you are reading a study that reports a statistically significant finding and includes several control variables in the regression model. You learn that this finding only emerges with the addition of these control variables and that adding them increases the effect size estimate, shifting it from nonsignificance to significance. After learning this fact, you will want reassurance that researchers have a strong explanation for the control-variable-induced increase in estimated effect size, sometimes called a suppression effect. You will especially want reassurance when the statistical significance of the key finding depends on this suppression effect. To quote an introductory statistics textbook: “In fact, it is pretty much standard operating procedure that when suppressors arise most researchers dismiss the finding as a statistical artifact unless there is a very strong theoretical explanation for the result” [@bobko_correlation_2001, 254]. Likewise, Andrew Gelman and his co-authors write: “Suffice it to say that, generally, suppression effects are considered statistical artifacts unless there is a strong theoretical explanation for their occurrence” [@crede_questionable_2016]. 

In this paper, we investigate how often studies depend on suppression effects. While many definitions exist, we define suppression effects as increases in the key coefficient estimate that result from adding control variables. This definition corresponds with Conger's [-@conger_revised_1974, 36-37] expansive and generally accepted definition of a suppressor variable: "a variable which increases the predictive validity of another variable (or set of variables) by its inclusion in a regression equation," where predictive validity is assessed by the magnitude of the regression coefficient [@tzelgov_suppression_1991;@mackinnon_equivalence_2000]. This definition includes reciprocal suppression or cooperative suppression, which is a term some researchers use for the most familiar form of suppression, where control variables increase the magnitude of the key effect but do not change the sign. It also includes the somewhat less common case where the key coefficient switches signs, sometimes called negative or net suppression.[^oem] Finally, it includes the rare classical suppression, where the suppressor variable is nearly uncorrelated with the dependent variable [@horst_role_1941]. Several sources formally define and discuss types of suppression [@nickerson_simpson_2019; @conger_revised_1974;@lewis_suppression_1986;@tzelgov_suppression_1991]. 
	
[^oem]: We acknowledge that negative suppression may not always fall under Conger's definition, such as when a coefficient estimate changes signs but is smaller in absolute magnitude. We continue to call such cases suppression.

Although suppression effects should be treated with scrutiny, they can be justified when researchers have a strong explanation for their existence, that is, a strong explanation for why their key effect estimate is suppressed. Tesler and Sears [-@tesler_obama_2010, chapter 6] provide an example in the 2008 US presidential election. To their surprise, they found that white Democrats with "gender traditional" attitudes tended to choose Hillary Clinton over Barack Obama in the Democratic primary, even though Hillary Clinton was a feminist icon. When they controlled for racial resentment, however, the relationship between gender traditionalism and vote choice switched: now the gender traditionalists were less likely to support Clinton. Why? They show that gender traditionalism and racial resentment correlated and that racial resentment much more strongly predicted vote choice. Racial resentment therefore suppressed the positive association between gender traditionalism and opposing Clinton. When controlled for, the expected relationship emerged. In this case, the authors provided a plausible explanation for suppression.

Even though suppression effects can be justified, they deserve greater scrutiny for several reasons. The first is robustness. When researchers have a strong bivariate relationship, one that holds up with every control they can throw at it, they feel confident about it---it seems robust. When a finding depends on suppression effects, however, it is not robust to alternative model specifications---it is by definition acutely sensitive to the specification, as in the Tesler and Sears example. When readers are evaluating statistical findings, they generally want to know which of these two worlds they’re in: generally robust to model specification or generally not. If the latter, they require much greater confidence in the model specification that generates the findings. 
	
A second reason for skepticism of suppression effects is that they can introduce bias that favors the authors in a nontransparent way. Consider an example of a data generating process with one key variable and 10 control variables, the first five of which increase the key effect estimate, and the remaining five decreases the key effect estimate. As a researcher adds these controls sequentially to a model, the first five will increasingly bias their key effect estimate upwards, as these are omitted suppressors. They induce bias because they unmask bias from the still omitted five control variables. Only when the researcher adds all 10 will she estimate an unbiased effect. In this example, a researcher could include the first five control variables, but leave out some of the second five, biasing the effect estimate upwards. Since readers never know the "true" data generating process, they may not realize that control variables have been excluded. Researchers have strong incentives to publish statistically significant findings and, given the numerous forking paths they face when making research decisions about control variables, they may have many opportunities to intentionally or unintentionally take advantage of suppression effects. While all control variables deserve scrutiny, suppression effects can induce bias that favors the authors, and do so in a way that may be hidden from readers, and so arguably deserve greater scrutiny. In contrast, when control variables reduce estimated effect sizes, they work against authors’ publishing incentives and so may require less scrutiny.[^1] 

[^1]: We quote additional authors on why researchers should scrutinize suppression effects in the supporting information. The idea that controls can introduce biases is unintuitive to some readers. This hypothetical example illustrates that they can when some but not all controls are included. Numerous articles have discussed bias introduced by controls [@bhattacharya_instrumental_2012;@clarke_phantom_2005;@cole_illustrating_2010;@middleton_bias_2016;@pearl_class_2010;@pearl_invited_2011;@pearl_causality_2009;@wooldridge_should_2016;@wyss_reducing_2014]. 

A third reason for skepticism is that control variables can introduce bias through opaque mechanisms. The hypothetical example above illustrates the potential for bias that favors authors from confounding. But control variables can also introduce bias that favors authors through amplification bias or through mediation, to name just two [@middleton_bias_2016;@pearl_class_2010;@mackinnon_equivalence_2000]. Given the complexity of some of these effects, readers need to know if findings could potentially depend on them.

As a reader, you therefore want to be alerted to suppression effects, especially when the statistical significance of the key finding depends on them. As a field, we want to know how often leading journals are publishing such articles and especially how often they are doing so without alerting readers, reviewers, and editors. 

So, how often do articles depend on undisclosed suppression effects for statistical significance? In this paper, we investigate this question by analyzing replication data from a leading journal. We examine whether findings depend on suppression effects by asking whether the main estimate presented in the articles, which includes control variables, is larger an absolute value than a bivariate estimate, which excludes those controls. Specifically, we look for how often statistical significance of the articles key result depends on the increase in estimated effect size induced by control variables, that is, suppression effects. We find a startling result: over 30% of observational studies depend on suppression effects for the statistical significance of their findings. Moreover, we find that none of these studies disclose this fact. 

These findings may point to a gap in the review process: journals publish articles that depend on suppression effects for statistical significance without the awareness of readers, reviewers, or editors. They also reinforce concerns about the potential for researcher discretion with control variables, concerns that have existed for decades [@leamer_let_1983]. Replication efforts and meta-analyses suggest that a reasonably large fraction of studies are false positives or report much larger effects than actually exist [@ioannidis_why_2005;@ioannidis_power_2017;@klein_investigating_2014;@klein_many_2018]. They have also pointed to suspicious patterns in test statistics [@gerber_publication_2010;@gerber_statistical_2008;@brodeur_star_2016]. Undisclosed suppression effects may be one source of these patterns. Many solutions exist for this problem, some of which we discuss in our conclusion. The simplest solution, however, is for reviewers and editors to ask for a tad more transparency.

# Data 
To conduct this analysis, we replicate and reanalyze studies published in the _American Journal of Political Science_ (AJPS). AJPS was one of the earliest social science journals to adopt a firm data transparency policy, enforcing the posting of replication data and code for articles published beginning in 2013. Following our pre-analysis plan,[^2] we analyze articles from _AJPS_ 2013-2015 that focus mainly on establishing a single causal claim and have a standard statistical model with at least one control variable (see Table S1 in the supporting information for exclusion reasons). `r nrow(bv_mv)` of 163 articles in these years met these criteria. `r n_obs` of these are observational and `r n_experiment` are experimental. The topics of these studies range widely, from the effects of judges having daughters on their rulings to whether United Nations peacekeepers succeed in protecting civilians. On average, these studies include `r round(avg_num_covar,1)` control variables in the fully specified model.

[^2]: See this link: https://bit.ly/2luoV3E. The plan specifies how we collected the data. We did not pre-specify the key analyses in this paper (Figure 3). We present the analyses we did specify in the supporting information and the results are consistent with our main findings.

We successfully reproduce the main findings in each of these articles, including the coefficient estimates and the standard errors, as Figure 1 shows. Given the difficulties researchers have faced with reproducibility [@king_replication_1995], this result is reassuring. In a handful of cases, the figure reveals that we estimate different standard errors.



```{r, echo=F, message=FALSE, warning=FALSE,fig.cap="Replication of Key Estimates and Standard Errors for Full Specification Models"}
p1 <- ggplot(merged) + geom_abline(slope=1, lty='dashed', col='grey') + geom_point(aes(x=log(abs(coeff_presented)), y=log(abs(mv_b))), size=.5) + theme_classic() + labs(x="log Published Estimate", y="log Replicated Estimate") + coord_fixed(ratio = 1)
p2 <- ggplot(merged) + geom_abline(slope=1, lty='dashed', col='grey') + geom_point(aes(x=log(abs(se_presented)), y=log(abs(mv_se))), size=.5) + theme_classic() + labs(x="log Published Standard Error", y="log Replicated Standard Error") + xlim(-6,1) + ylim(-6,1)+ coord_fixed(ratio = 1)
grid.arrange(p1, p2, ncol = 2, nrow = 1)
```



```{r, message=FALSE, warning=FALSE, include=FALSE}
#All p-value change
# difference between shown and not shown deflation
summary(lm(inflate_p ~ notshown, data=merged %>% filter(type %in% c(0,4)) %>% mutate(notshown = ifelse(no_covar_presented > 0, 0, 1) )  ))
wilcox.test(inflate_p ~ notshown, data=merged %>% filter(type %in% c(0,4)) %>% mutate(notshown = ifelse(no_covar_presented > 0, 0, 1) ))
t1<-t.test(inflate_p ~ notshown, data=merged %>% filter(type %in% c(0,4)) %>% mutate(notshown = ifelse(no_covar_presented > 0, 0, 1) )  )
pvchnoshow<-round( abs(t1$estimate[2]),digits = 2) 
pvchshow<-round(t1$estimate[1],digits = 3)
t1<-wilcox.test(inflate_p ~ notshown, data=merged %>% filter(type %in% c(0,4)) %>% mutate(notshown = ifelse(no_covar_presented > 0, 0, 1) ))
options(scipen=999)
t1<-round(t1$p.value,digits = 4)
# excluding sign flips
t2<-wilcox.test(inflate_p ~ shown, data=merged %>% filter(type %in% c(0,4)) %>% filter(sign(bv_b)==sign(mv_b)) %>% 
         mutate(shown = ifelse(no_covar_presented > 0, 0, 1)))
t2<-round(t2$p.value,digits = 4)


#P-value change only from coefficient change
# difference between shown and not shown deflation only from coefficient chagne
pcc<-merged %>% mutate(bv_t = bv_b/bv_se, mv_t = mv_b/mv_se, t_mvb_bvse = mv_b/bv_se, t_bvb_mvse = bv_b/mv_se,
         cp = ifelse(no_covar_presented > 0, "Shown", "Bivariate Specification Not Shown"),
         p_mvb_bvse = pnorm(-abs(mv_b/bv_se))*2, 
         p_mvb_bvse2 = ifelse(sign(mv_p-bv_p)!=sign(p_mvb_bvse-bv_p), bv_p, 
                       ifelse(bv_p<mv_p, ifelse(p_mvb_bvse>mv_p, mv_p, p_mvb_bvse), 
                       ifelse(p_mvb_bvse<mv_p, mv_p, p_mvb_bvse))), 
         p_bvb_mvse = pnorm(-abs(bv_b/mv_se))*2,
         flip <- ifelse(sign(bv_b)!=sign(mv_b), 1, 0),
         p_end=p_mvb_bvse2,
         inflate_p_coef =ifelse(flip==1,(p_end-1)-(1-bv_p),p_end-bv_p),
         notshown = ifelse(no_covar_presented > 0, 0, 1) ) 

summary(lm(inflate_p_coef~notshown,data=pcc%>% filter(type %in% c(0,4)) ))
t3<-t.test(inflate_p_coef~notshown,data=pcc%>% filter(type %in% c(0,4)) )
copvchnoshow<-round( abs(t3$estimate[2]+t3$estimate[1]),digits = 2) 
copvchshow<-round(t3$estimate[1],digits = 3)
wilcox.test(inflate_p_coef~notshown,data=pcc%>% filter(type %in% c(0,4)) )
t3<-wilcox.test(inflate_p_coef~notshown,data=pcc%>% filter(type %in% c(0,4)) )
t3<-round(t3$p.value,digits = 5)
# excluding sign flips
t.test(inflate_p ~ shown, data=pcc %>% filter(type %in% c(0,4)) %>% filter(sign(bv_b)==sign(mv_b)) %>% 
         mutate(shown = ifelse(no_covar_presented > 0, 0, 1)))
t4<-wilcox.test(inflate_p ~ shown, data=pcc %>% filter(type %in% c(0,4)) %>% filter(sign(bv_b)==sign(mv_b)) %>% 
         mutate(shown = ifelse(no_covar_presented > 0, 0, 1)))
t4<-round(t4$p.value,digits = 4)

# pcc%>% filter(type %in% c(0,4) & notshown==1) %>% with(plot(inflate_p_coef,inflate_p))
# pcc%>% filter(type %in% c(0,4) & notshown==1) %>% with(mean(inflate_p_coef))
# pcc%>% filter(type %in% c(0,4) & notshown==1) %>% with(sqrt(var(inflate_p_coef)/length(inflate_p_coef)))
# pcc = pcc %>% filter(type %in% c(0,4) & notshown==1) %>%select(author, no_covar_presented,notshown, inflate_p, inflate_p_coef)
# se <- sqrt(var(x)/length(x))
```

# Frequency and Disclosure of Suppression Effects in Observational Studies
We first examine how often researchers use control variables to achieve statistical significance through suppression effects in observational studies (we examine experimental studies later). For each article, we examine the effects of control variables on the estimated effect of the key explanatory variable, the variable about which the article attempts to establish a causal claim. To do so, we compare the p-value of the key explanatory variable estimate in the full model (all control variables) to a bivariate model. In cases where the research design or identification strategy requires covariates (such as the main effects of an interaction term, lagged dependent variables, and fixed effects), we include these and count this as a bivariate specification. About half of studies include some covariates in the ``bivariate specification,’’ but our main finding holds up when we look only at studies that did not require them (see Figure 6 below and Figures S1, S2, and S4 in the supporting information). 
	
Given the nature of our analysis---examining the effects of whatever control variables authors used---our definition of suppression effects based on Cogner is expansive. Our definition includes all types of suppression effects generated through any mechanism.  Conger's definition refers to a single suppressor but we are examining the effects of one or more control variables---whatever the original authors used---and therefore apply the concept to the general linear model originally suggested by Holling [-@holling_suppressor_1983]. Although our analysis does not examine the types or mechanisms giving rise to suppression, researchers' analysis should depend on the nature of the suppression effect, a topic we return to later.
	
Since we are especially interested in how often findings depend on undisclosed suppression effects, we examine the effect of controls separately among articles that disclose or do not disclose a bivariate estimate for their key effect. When researchers reveal a bivariate estimate, readers can determine whether the key result depends on suppression effects, justified or otherwise. To classify whether researchers disclosed, we count studies as revealing the potential presence of suppression effects if they report a bivariate estimate for their key explanatory variable. We count this if they report it in a statistical model or in some other form, such as a scatterplot, cross-tab, or difference in means. We count these regardless of whether they occur in figures, tables, the text, or in notes. 

Figures 2 and 3 present the main finding of this paper.[^ody] Figure 2 begins by showing the effect of control variables on the p-values of the key effect estimates and how often they disclose this fact. The figure presents arrows that start at the p-value of the bivariate specification and end at the p-value of the full specification. When control variables lower p-values of the key estimate, the arrows end pointing downwards. When control variables increase p-values, the arrows end pointing upwards. The figure shows a large number of downward arrows. When researchers include control variables, their p-values drop a lot, many from well above 0.05 to below 0.05. In four cases, the introduction of controls switches the estimated effect direction. To convey this switch, the figure shows the arrows increasing from the bivariate specification p-value, to p=1, and then back down to the p-value for the full specification. 

[^ody]: Given the complexity of the findings we present, we show the same results with several alternative figure designs in the supporting information. See Figures S6 and S7a-S7d (we thank an anonymous reviewer for the suggestions).

These drops in p-values only represent suppression effects if they result from increases in the absolute value of coefficients (as the decreases could also result from smaller standard errors). Figure 3 adds the additional information of how much of these p-value drops result from coefficient changes. It presents the key result of this paper. To show how much of these p-value drops results from coefficient changes, Figure 3 alters the arrows so that the solid portion now shows the changes in the p-value attributable only to coefficient changes, that is, the change attributable to suppression effects. The dotted component of the arrows shows the p-value changes attributable to only standard error changes. As the solid portion of the arrows makes clear, suppression effects drive much of the p-value decreases, though both components contribute.  Based on a careful examination of Figure 3, suppression effects helped the authors lower their p-values towards 0.05 in about 20 of the `r n_obs` observational articles. Not all of these studies achieved statistical significance at p<0.05 in the full specification, but the decreases in p-values may have been necessary for publication.

Figures 2 and 3 split the results into the left panel showing studies that did not present a bivariate specification and the right panel showing studies that did. Only a single one of the 20 studies revealed the presence of suppression effects.

In sum, undisclosed suppression effects are common. We cannot know how many of the 19 studies that did not disclose suppression would have been published without the contribution from suppression effects, but it seems possible that many or even all would not have. So, undisclosed suppression effects may have contributed to the publication of as much as (19/`r n_obs` $\approx$) 40% of observational studies. A conservative estimate would be (15/`r n_obs` 	$\approx$) 30%. Another way of describing this finding is that, when reading an observational article that does not show a bivariate estimate in some form, readers should assume as high as a (19/34 $\approx$) 55% chance that it depends on suppression effects for significance, a high probability.

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="P-Value Changes in Observational Studies. Arrows for each study show the total p-value changes from the bivariate to the full specification (all controls). The figure shows that, when articles did not present a bivariate specification, control variables lowered p-values for the articles' key effect estimate. The four up and down arrows reflect cases where the estimate changed sign from the bivariate to the full specification. The next figure shows how much of the p-value drops result from changes in coefficients from suppression effects versus changes in standard errors."}

ggplot() + 
  geom_segment(data=temp %>% filter(type=="total p-value change from bivariate to full specification"), aes(x=x,y=bv_p,yend=p_end,xend=x, linetype='longdash'), inherit.aes=FALSE, arrow = arrow(length=unit(.06, 'inches'), type='closed')) + 
facet_grid(~ cp, scale="free_x", space="free_x") + theme_bw() + labs(x="", y="p-value") + geom_hline(yintercept=0.05, lty='dashed', col='grey') + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.title = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = 'none', legend.text=element_text(size=7)) + scale_shape(solid = FALSE) + scale_fill_discrete(guide = guide_legend(reverse=TRUE)) 
```

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="P-Value Changes in Observational Studies from Coefficient and Standard Error Change. Arrows for each study show the total p-value changes from the bivariate to the full specification (all controls). The solid part of the arrows shows the p-value changes from only coefficient estimate changes, while the dotted part shows the remaining p-value changes from standard error changes. The figure shows that, when articles failed to present a bivariate specification, they often depend on suppression effects to achieve statistical significance. The four up and down arrows reflect cases where the estimate changed sign from the bivariate to the full specification. Figure 6 and Figures S1 and S2 in the supporting information present robustness checks."}


ggplot() + 
  geom_segment(data=temp %>% filter(type=="total p-value change from bivariate to full specification"), aes(x=x,y=bv_p,yend=p_end,xend=x, linetype='longdash'), inherit.aes=FALSE, arrow = arrow(length=unit(.06, 'inches'), type='closed')) + 
  geom_segment(data=temp %>% filter(type=="p-value change from coefficient change"), aes(x=x,y=bv_p,yend=p_end,xend=x, linetype='solid'), inherit.aes=FALSE) + 
  facet_grid(~ cp, scale="free_x", space="free_x") + theme_bw() + labs(x="", y="p-value") + geom_hline(yintercept=0.05, lty='dashed', col='grey') + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.title = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = c(.25, .9), legend.text=element_text(size=7)) + scale_shape(solid = FALSE) + scale_fill_discrete(guide = guide_legend(reverse=TRUE)) + 
  scale_linetype_manual(values=c("dashed", "solid"), labels=c("p-value change from reducing standard errors", "p-value change from inflating coefficient"))

```

The effect of suppression on p-values is considerable, especially in studies that do not reveal the presence of suppression effects. In studies that are transparent by showing a bivariate, p-values actually slightly increase (by `r pvchshow`). In studies that are not transparent, p-values decrease by `r pvchnoshow` on average. The difference in p-value changes between those that do and do not is highly statistically significant (p=`r t1`, Wilcoxon rank-sum test). If we limit the p-value changes to only that due to coefficient changes, non-transparent studies decrease them by `r copvchnoshow` on average and the difference in p-value changes remains highly statistically significant (p=`r t3`, Wilcoxon rank-sum test).

To show how control variables give rise to the p-value changes in Figures 2 and 3, Figure 4 presents the log change in coefficients and standard errors of key variables from the bivariate to the multivariate. It uses log changes because they approximate percent changes (while avoiding the symmetry and additivity problems of percent changes).[^cba] Each dot shows either the log change in coefficients or in standard errors, and we sort dots by the p-value changes, as in Figures 2 and 3. Figure 4 makes clear that suppression effects occur in several cases where they have no impact on the statistical significance of studies. In such cases, suppression effects, even if not justified, are less problematic since statistical significance does not depend on them. The figure also makes clear that even small changes in effect sizes can have large effects on statistical significance. Finally, this figure illustrates the degree to which increased precision, a frequent justification of the inclusion of control variables, isn't terribly common.  

[^cba]: To calculate the log change, we code the multivariate and bivariate estimates to positive. In the four cases where the signs flip between the bivariate and the multivariate, we rescale so that the bivariate estimate is zero and we add the bivariate to the multivariate. We then add one to all the estimates before taking the log. We also add one before calculating the log change in standard errors to keep the scale for coefficients and standard errors similar. Adding constants other than one doesn't substantively change the findings.


```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Key Coefficient and Standard Error Change in Observational Studies. Each dot shows either the log change in coefficients (top) and standard errors (bottom). The dots are sorted by the p-value changes, the exact same order as in Figures 2 and 3."}
temp <- merged %>% 
  filter(type==0|type==4) %>%  # Select observational studies (type 0) and natural experiments (type 4)
  mutate(cp = ifelse(no_covar_presented > 0, "Shown", "Bivariate Specification Not Shown"),
         bv_t = bv_b/bv_se, 
         mv_t = mv_b/mv_se, 
         t_mvb_bvse = mv_b/bv_se, 
         t_bvb_mvse = bv_b/mv_se,
         p_mvb_bvse = pnorm(-abs(mv_b/bv_se))*2, 
         p_bvb_mvse = pnorm(-abs(bv_b/mv_se))*2,
         flip = ifelse(sign(bv_b)!=sign(mv_b), 1, 0),
         inflate_p = ifelse(flip==1, 1, bv_p - mv_p))

temp1 <- subset(temp, cp=="Shown")
temp1[order(temp1$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp1)
temp2 <- subset(temp, cp=="Bivariate Specification Not Shown")
temp2[order(temp2$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp2)
temp <- rbind(temp1 %>% 
                select(author, x, y=inflate, cp, flip) %>% 
                mutate(ylabel="Coefficient log change"), 
              temp2 %>% 
                select(author, x, y=inflate, cp, flip) %>% 
                mutate(ylabel="Coefficient log change"),
              temp1 %>% 
                select(author, x, y=inflate_se, cp, flip) %>% 
                mutate(ylabel="Standard error log change"),
              temp2 %>% 
                select(author, x, y=inflate_se, cp, flip) %>% 
                mutate(ylabel="Standard error log change"))

ggplot(temp) + geom_hline(yintercept=0, lty='dashed', col='grey') + geom_point(aes(x=x, y=y), inherit.aes=FALSE) + facet_grid(ylabel ~ cp, scale="free", space="free_x") + theme_bw() + labs(x="", y="")  + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.title = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = c(.25, .9), legend.text=element_text(size=7)) + scale_shape(solid = FALSE) + scale_color_manual(values=c("black", "grey")) + scale_fill_discrete(guide = guide_legend(reverse=TRUE))
```

How many of the undisclosed cases of suppression represent false positives where? That is, how many of these are cases where researchers are relying on suppression effects, intentionally or not, to publish nonexistent effects? This is a hard question. We spent considerable time reading and reanalyzing replication data for these articles, trying to determine whether suppression effects could be justified, that is, whether researchers could have provided a good explanation for their key effect estimates being suppressed.

In some cases, we think they do. One example is Davenport [-@davenport_policyinduced_2015], who examines the effect of casualties and low draft numbers on parents' turnout during the Vietnam War. She finds that parents whose children are at high risk of being drafted (low lottery numbers) and who live in towns that have casualties, are more likely to turn out. One can tell a simple story about suppressor variables: poor regions of the country have lower turnout and disproportionately contribute soldiers to combat roles in Vietnam, so are hit disproportionately by casualties. Socio-economic status may therefore suppress Davenport's key interaction estimate. Indeed, in reanalyzing Davenport's data, we find that prior turnout (which likely captures individual socio-economic status) and town measures of socio-economic status increase the size of her key interaction estimate, lowering its p-value from `r round(bv_mv[which(bv_mv=='Davenport'),'bv_p'], digit=2)` in the bivariate specification to `r round(bv_mv[which(bv_mv=='Davenport'),'mv_p'], digit=2)` in the full specification. So, Davenport can provide a good explanation for why her finding depends on suppression effects.

For many articles, however, we cannot find an obvious justification for suppression effects, though it is possible that authors could provide one. In several of these articles, unusual control variables were the key suppressors. These include variables that arguably should not be in the models, such as post-treatment variables, and variables that change the interpretation of the finding in a way that seems unintended by the authors, such as including a lagged dependent variable when the article is not about explaining change. These potentially questionable control variables would have been harder to miss in the review process if authors had alerted readers to the presence of suppression effects.
	
```{r, message=FALSE, warning=FALSE, include=FALSE}	
# difference between shown and not shown deflation
summary(lm(inflate_p ~ observational, data=merged %>% mutate(observational = ifelse(type==0|type==4, 1, 0))))
t4<-t.test(inflate_p ~ observational, data=merged %>% mutate(observational = ifelse(type==0|type==4, 1, 0)))
t4<-round(t4$p.value,digits = 3)

med.n<-median(merged$mv_n[merged$type==1 | merged$type==2| merged$type==3])
min.n<-min(merged$mv_n[merged$type==1 | merged$type==2| merged$type==3])

```

# Frequency and Disclosure of Suppression Effects in Experimental Studies

Suppression effects should be less common in experimental studies that randomly assign treatment in large samples. In observational studies, controls can change the key variable's estimate because they can correlate with the dependent variable and key independent variable. When researchers randomly assign the treatment in large samples, by contrast, the correlation with controls is minimal, especially as the sample size increases. Therefore, we would expect minimal sensitivity to control variable choice. All the experiments we examine here used random assignment and have relatively large Ns; the median N is `r med.n` and smallest is `r min.n`. Consistent with this expectation, Figure 5 reveals little sign of p-value decreases in experimental studies with controls. Randomized experiments have many advantages---one is less vulnerability to discretion in control variable choice.

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="P-Value Changes in Experimental Studies. For each article, the arrows show the total p-value changes from the bivariate to the full specification. The solid part of the arrows shows the p-value changes from only coefficient estimate changes, while the dotted part shows p-value changes from standard error changes."}
	
temp <- merged %>% filter(type %in% 1:3) %>%  # Select experimental studies (survey, field, lab)
  mutate(cp = ifelse(no_covar_presented > 0, "Bivariate Specification Shown", "Not Shown"),
         bv_t = bv_b/bv_se, mv_t = mv_b/mv_se, t_mvb_bvse = mv_b/bv_se, t_bvb_mvse = bv_b/mv_se,
         p_mvb_bvse = pnorm(-abs(mv_b/bv_se))*2, 
         p_mvb_bvse2 = ifelse(sign(mv_p-bv_p)!=sign(p_mvb_bvse-bv_p), bv_p, 
                              ifelse(bv_p<mv_p, ifelse(p_mvb_bvse>mv_p, mv_p, p_mvb_bvse), ifelse(p_mvb_bvse<mv_p, mv_p, p_mvb_bvse))), 
         p_bvb_mvse = pnorm(-abs(bv_b/mv_se))*2)
temp$flip <- ifelse(sign(temp$bv_b)!=sign(temp$mv_b), 1, 0)
#temp$bv_p[sign(temp$bv_b)!=sign(temp$mv_b)]<-1
temp$inflate_p <- ifelse(temp$flip==1, 1, temp$bv_p - temp$mv_p)
temp1 <- subset(temp, cp=="Bivariate Specification Shown")
temp1[order(temp1$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp1)
temp2 <- subset(temp, cp=="Not Shown")
temp2[order(temp2$inflate_p, decreasing=F),"x"]  <- 1:nrow(temp2)
temp <- rbind(temp1 %>% select(author, x, bv_p, p_end=mv_p, cp, flip) %>% mutate(type="total p-value change from bivariate to full specification"), 
              temp2 %>% select(author, x, bv_p, p_end=mv_p, cp, flip) %>% mutate(type="total p-value change from bivariate to full specification"),
              temp1 %>% select(author, x, bv_p, p_end=p_mvb_bvse2, cp, flip) %>% mutate(type="p-value change from coefficient change"),
              temp2 %>% select(author, x, bv_p, p_end=p_mvb_bvse2, cp, flip) %>% mutate(type="p-value change from coefficient change"),
              temp2 %>% select(author, x, bv_p, p_end=p_mvb_bvse2, cp, flip) %>%
                mutate(x=c(0,3), bv_p=-1, p_end=-1, type="p-value change from coefficient change"))

ggplot() + 
  geom_segment(data=temp %>% filter(type=="total p-value change from bivariate to full specification"), aes(x=x,y=bv_p,yend=p_end,xend=x, linetype='longdash'), inherit.aes=FALSE, arrow = arrow(length=unit(.06, 'inches'), type='closed')) + 
  geom_segment(data=temp %>% filter(type=="p-value change from coefficient change"), aes(x=x,y=bv_p,yend=p_end,xend=x, linetype='solid'), inherit.aes=FALSE) + ylim(0,1) +
  facet_grid(~ cp, scale="free_x", space="free_x") + theme_bw() + labs(x="", y="p-value") + geom_hline(yintercept=0.05, lty='dashed', col='grey') + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.title = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = c(.25, .9), legend.text=element_text(size=7)) + scale_shape(solid = FALSE) + scale_fill_discrete(guide = guide_legend(reverse=TRUE)) + 
  scale_linetype_manual(values=c("dashed", "solid"), labels=c("total p-value change from reducing standard errors", "p-value change from inflating coefficient"))
```

# Robustness and Potential Objections

The main finding of this paper, shown in Figures 2 and 3, is robust: when researchers don't disclose a bivariate, adding control variables drops p-values consistently across various subsets of the data, as shown in Figure 6. P-values drop when researchers use fixed effects or don't use fixed effects and when we exclude the cases where the sign of the effects flip. They drop in each of the three volumes of the _AJPS_, in each of the three empirical subfields (American, comparative, and international relations), and when the bivariate/minimal specification has no covariates versus when we must include covariates for the minimal specification to make sense. Finally, they also drop in studies with less than the median number of controls, which is eight, and in studies with more than the median number of controls (see also Figures S1 and S2 in the supporting information).
	
One potential alternative explanation for our key finding is that researchers may not show the bivariate specification when controls favor them because they are working on topics where suppressor variables are well known. To address this possibility and to examine whether researchers justified suppression effects, we carefully read the sections on controls in these articles but found no acknowledgment or justification. We present each articles' text relating to controls in the supporting information (for articles that did not present a bivariate specification).
	
Another account for our key finding comes from the incentives of the journal review process. Researchers may believe that, if they disclose suppression effects, reviewers will get hung up on them. The pattern we observe---only one published study relies on suppression effects for significance and discloses this fact---is consistent with this concern. If this is authors’ motivation, however, it points to a hole in the review process, as reviewers should be assessing the plausibility of control-induced increases in effect sizes.


```{r, echo=F, message=FALSE, warning=FALSE,fig.cap="P-value Changes in Studies that Do Not Show Bivariate. This plot shows the average change in p-values from the bivariate to the full specification (with 95% confidence intervals) for observational and experimental studies that did not show a bivariate specification. The mean number of controls in these studies is nine and the median is eight. Figures S1 and S2 in the supporting information present similar plots for all studies and for p-value changes only from coefficient changes."}
#amount of p-value change plot for studies that don't show bivariate (by fixed effects, no fixed effects, volume, etc.)
merged_STATA <- inner_join(merged, STATA, by="author")
merged_STATA$fixed_effects.x[is.na(merged_STATA$fixed_effects.x)] <- 0
merged_STATA$flip <- ifelse(sign(merged_STATA$bv_b)!=sign(merged_STATA$mv_b), 1, 0)

merged_STATA$inflate_p <- ifelse(merged_STATA$flip==1, (merged_STATA$mv_p -1) - (1-merged_STATA$bv_p), merged_STATA$mv_p- merged_STATA$bv_p )
merged_STATA$n_covariates <- stringr::str_count(merged_STATA$covariates, " ") + stringr::str_count(merged_STATA$fixed_effects.y, " ")
median_covar <- median(merged_STATA$n_covariates)

inflate.plot.ci <- function(name, dat){
  #tt <- t.test(dat$bv_p, dat$mv_p_flip, paired=TRUE)
  mean <- mean(dat$inflate_p)
  sd <- sd(dat$inflate_p)
  n <- nrow(dat)
  ci <- 1.9 * sd / sqrt(n)
  out <- c(paste0(name," (n=",n,")"), mean, sd, ci)
  names(out) <- c("name", "mean", "sd", "ci")
  return(out)
}
plot_data <- inflate.plot.ci("All", subset(merged_STATA, no_covar_presented==0))
plot_data <- rbind(plot_data, inflate.plot.ci("Observational", subset(merged_STATA, type %in% c(0,4) & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Experimental", subset(merged_STATA, type %in% c(1:3) & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Obs., Fixed Effects", subset(merged_STATA, type %in% c(0,4) & fixed_effects.x>0 & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Obs., No Fixed Effects", subset(merged_STATA, type %in% c(0,4) & fixed_effects.x==0 & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Obs., Excl. Flip-Sign Cases", subset(merged_STATA, type %in% c(0,4) & no_covar_presented==0 & sign(bv_b)==sign(mv_b))))
plot_data <- rbind(plot_data, inflate.plot.ci("Volume 57", subset(merged_STATA, vol==57 & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Volume 58", subset(merged_STATA, vol==58 & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Volume 59", subset(merged_STATA, vol==59 & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("American", subset(merged_STATA, subfield=="American" & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Comparative", subset(merged_STATA, subfield=="Comparative" & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("IR", subset(merged_STATA, subfield=="IR" & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Bivariate Specification is Bivariate", subset(merged_STATA, secondary_vars=="" & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Bivariate Specification not Bivariate", subset(merged_STATA, secondary_vars!="" & no_covar_presented==0)))
plot_data <- rbind(plot_data, inflate.plot.ci("Below Median Number of Controls", subset(merged_STATA, n_covariates < median_covar)))
plot_data <- rbind(plot_data, inflate.plot.ci("Above Median Number of Controls", subset(merged_STATA, n_covariates >= median_covar)))

plot_data <- as.data.frame(plot_data)
plot_data$mean <- as.numeric(as.character(plot_data$mean))
plot_data$sd <- as.numeric(as.character(plot_data$sd))
plot_data$ci <- as.numeric(as.character(plot_data$ci))
plot_data$x <- nrow(plot_data):1

ggplot(plot_data, aes(x=x, y=mean)) + geom_point(size=2) + geom_linerange(aes(ymin=mean - ci, ymax=mean + ci)) + 
  labs(y="Average p-value change from adding control variables", x="", col=NULL) + 
  theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  geom_hline(yintercept=0, lty=2) + coord_flip(ylim=c(-.6, .1)) + 
  scale_x_continuous(breaks=1:nrow(plot_data), labels=plot_data$name[nrow(plot_data):1]) 
```

Some scholars have argued that showing multiple specifications in regression tables, as researchers often do, is enough, especially when estimates generally seem stable across the multiple specifications. We believe, however, that researchers may underappreciate the degree to which estimates vary across combinations of controls not shown. Researchers often present a table of several regression specifications, sequentially adding controls, but these represent a handful of thousands of possible specifications. We show the range of effects researchers could have produced in Figure 7, which presents the distribution of t-statistics across all possible control combinations (randomly sampled in cases with many control) for the observational studies. We show t-statistics to scale the coefficients across these studies. As before, the figure breaks the studies into those that show a bivariate specification and those that do not. It also shows the t-statistic for the bivariate specification (b) and the full multivariate specification (m). The figure shows a large range of estimates in many of the studies across control combinations, with estimates often crossing the threshold for statistical significance and sometimes producing effects of opposite signs. This figure understates the potential for discretion since it only reflects the controls the authors chose to condition on. And, of course, control variable choice is just one of several sources of modeling discretion.

Another objection we have encountered is that readers do not need bivariate specifications to assess the effects of control variable discretion because they can merely look at the control variables themselves. They can assess, for instance, whether researchers have left out controls that would work against their key finding, suppressing its effect. Undoubtedly, readers can do this in some cases. In our reanalysis of these articles, however, some of the controls responsible for increasing estimated effect sizes are unexpected, and we doubt experts would have anticipated them. 
	
Another objection is that researchers may be unable to justify the effects of controls on their estimates because they are too complicated to explain in a multidimensional context. Although this point has merit, we think that researchers will often have priors about the reasonableness of the net effect of controls. We also find that control variables largely have similar effects on their own when conditioning on other control variables (see Figure S3 in the supporting information). More importantly, if researchers cannot be required to provide justifications, then discretion will be left partly unchecked. 

Finally, it's worth emphasizing that the p-value drops we observe from adding controls do not generally result from large changes in the key coefficient estimates. Although scaling changes across studies is difficult, the key coefficients increase on average by only about 20%, calculated with log +1 changes, but those increases translate into large p-value decreases (see Figures 4 and S4-S5 in the supporting information). Researchers, therefore, are benefiting from suppression, but relatively weak suppression, enough to shift their p-values below 0.05. Consistent with this pattern, multicollinearity does not seem especially high in these studies, nor does it vary with p-value changes. In the supplemental information, we present plots of the average correlation between the key variable and control variables for all studies (see supporting information Figure S8).


```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.cap="Distribution of t-Statistics across All Possible Control Variable Combinations for Observational Studies (randomly sampled in cases with more than 10,000 combinations, sorted by variance). 'm' shows the full multivariate t-statistic and 'b' the bivariate t-statistic. The grey dotted lines mark t-statistics at -1.96 and 1.96, the p = 0.05 thresholds for conventional statistical significance. We exclude two studies here because of the inordinate time required to estimate each specification."}
plot_data <- left_join(merged, stacked_estimates, by='author')  %>% filter(type==0|type==4) %>%
  mutate(cp = ifelse(no_covar_presented > 0, "Shown", "Bivariate Specification Not Shown"))
test <- as.data.frame(unique(stacked_estimates$author))
names(test) <- "author"
test2 <- anti_join(merged %>% select(author), test)
order <- plot_data %>% group_by(author, cp) %>%	
  summarise(est_var = var(estimate/se, na.rm=T))
order <- order[order(order$cp, order$est_var),] %>% filter(!is.na(est_var))
order$order <- c(1:sum(order$cp=="Bivariate Specification Not Shown"),
                   1:sum(order$cp=="Shown"))
plot_data2 <- left_join(order, plot_data, by=c('author', 'cp'))

ggplot(plot_data2) +
  geom_density_ridges(aes(x = estimate/se, y = order, group = order)) +
  geom_point(aes(x=mv_b/mv_se, y=order), shape="m") +
  geom_point(aes(x=bv_b/bv_se, y=order), shape="b") +
  xlim(-10,10) + labs(x="t-statistic", y="") + theme_classic() +
  facet_grid(cp ~ ., scales='free', space='free') +
  theme(axis.ticks.y = element_blank(), axis.text.y = element_blank(), legend.position = "bottom") +
  geom_vline(xintercept=1.96, col="grey", lty='dashed') +
  geom_vline(xintercept=-1.96, col="grey", lty='dashed') +
  geom_vline(xintercept=0, col="black", lty='dashed') +
  scale_shape_manual(values = c("m", "b"), labels=c("full specification", "bivariate specification"))
```

# Discussion and Conclusion

Based on a reanalysis of political science studies, we find that statistical significance depends on suppression effects in a large fraction of observational studies. Almost none of these studies are transparent about this fact. By increasing the estimates of key effect sizes, undisclosed suppression effects may have contributed to the publication of over 30% of observational studies. When reading an observational article that does not show a bivariate estimate in some form, readers should assume nearly a 60% chance that the article’s key effect depends on suppression for statistical significance. Suppression effects in these articles may be well justified, but authors do not acknowledge, let alone justify them. Without disclosure, readers, reviewers, and editors cannot subject them to scrutiny, asking whether the authors have a good justification for suppression.

It is important to be clear about the argument of this paper. We are in no way claiming that suppression effects necessarily introduce bias. They can be an important part of observational research, capturing data generating processes whose outcomes would otherwise go unnoticed. Instead, we are arguing that, because suppression effects can introduce bias that favors authors, and do so in a nontransparent way, readers typically want to be aware of them. We are also in no way claiming that these 30-40% of articles that rely on suppression effects to achieve statistical significance are false positives. Instead, we are simply pointing out that editors and reviewers would likely have scrutinized the control variables more closely if they knew these articles depended on suppression effects for their statistical significance. That additional scrutiny may or may not change the publication outcomes for these articles. We are also in no way arguing that bivariate estimates are less biased or preferable. Instead, we are arguing that requiring disclosure of bivariates alerts readers to suppression effects and the extent of those effects. Readers want to know about their presence, we think, because they mean that estimates are, by definition, not robust to model specification, more vulnerable to hidden discretion by authors in their choice of controls, and potentially produced by opaque mechanisms such as amplification bias. In contrast, control variables that reduce the magnitude of key estimates can also introduce bias, but this bias works against authors.

The degree to which we find studies depending on suppression effects highlights the problems with research oriented around p-value thresholds [@mcshane_abandon_2019]. When researchers must report findings with p-values below a certain threshold to publish, incentives for them to "find a way" become strong. This leads to numerous perverse incentives, including model selection based on p-values.

As a field, we can limit discretion. Authors can assess robustness to a wider array of control variable choices with Bayesian Model Averaging [@bartels_specification_1997;@montgomery_bayesian_2010;@leamer_svalues_2016] and related techniques. They can also use specification-curve analysis in which they report their key effect estimate across all theoretically justifiable model specifications [@simonsohn_specification_2015]. Researchers can also limit the effects of model discretion by matching before they analyze their data [@hainmueller_entropy_2012;@sekhon_multivariate_2011;@imai_covariate_2014], reducing model extrapolation [@ho_matching_2007a]. They could use hold out samples and apply newly developed estimators for cases where the number of control variables is large [@athey_efficient_2016;@ning_high_2017]. They can also preregister control variables before data collection as part of a pre-analysis plan [@casey_reshaping_2012;@humphreys_fishing_2013]. 
	
Most simply, researchers could disclose the bivariate specification to allow reviewers and readers to assess the effects of adding control variables.[^3] If the bivariate specification departs noticeably from other specifications, authors need to explain why. This is difficult in cases with more than a handful of controls [@achen_new_2002a], though see Gelbach [-@gelbach_when_2016] for a formalization. By disclosing the bivariate, readers can assess whether control variables could be introducing bias that favors the authors. Given the high rate of false positives in published research, readers should know this basic fact.
	
[^3]: Presenting bivariate relationships is important for yet another reason: they may help researchers infer the level of confounding [@RSSB:RSSB12167;@oster_unobservable_2016;@altonji_evaluation_2005]. 

Besides limiting the potential for discretion, researchers should also be attuned to the source of the suppression effects. In some cases, researchers have strong justifications for them, such as in the Tesler and Sears example [-@tesler_obama_2010] and the Davenport [-@davenport_policyinduced_2015] examples discussed earlier. In other cases, however, researchers should consider excluding the suppressor variables, such as when the source is classical suppression, mediation, or amplification bias. 

# Authors' Note

Funding for this research was provided by the Berkeley Initiative for Transparency in the Social Sciences (BITSS), a program of the Center for Effective Global Action (CEGA), with support from the Laura and John Arnold Foundation. For helpful comments, we thank participants at the Midwest Political Science Association 2017 Meeting, the Berkeley Initiative for Transparency in the Social Sciences (BITSS) 2016 Annual Meeting, the Berkeley Research Workshop in American Politics, and Jasjeet Sekhon's methods seminar. We also thank Larry Bartels, Martin Bisgaard, Allan Dafoe, Vicky Fouka, Donald Green, Andy Hall, Jens Hainmueller, Daniel Hopkins, Macartan Humphreys, Greg Huber, Kosuke Imai, Josh Kalla, Martin Vinaes Larsen, Yotam Margalit, John Bullock, Seth Hill, John Marshall, Andrew McCall, David Nield, Mark Pickup, Jasjeet Sekhon, Laura Stoker, and the anonymous reviewers. Replication materials are available from Lenz and Sahn [-@data2020].

